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Biological data can be scarce and costly to obtain. The small number of samples 
available typically limits statistical power and makes reliable inference of causal 
relations extremely difficult. However, we argue that statistical power can be 
increased substantially by incorporating prior knowledge and data from diverse 
sources. We present a Bayesian framework that combines information from dif- 
ferent sources and we show empirically that this lets one. make correct causal 
inferences with small sample sizes that otherwise would be impossible. 


1 Introduction and Motivation 

There is a growing interest in the development and application of new computa- 
tional methodologies for analyzing genomic and proteomic data, ranging from 
clustering techniques 1 to algorithms for inferring regulatory networks . 2,3, 4,5,6 
However, most such methods concentrate on discovering regularities in individ- 
ual data sets and operate in a knowledge-lean manner. This contrasts sharply 
with the strategies of most biologists, who focus on testing specific hypotheses 
formulated in the context of biological knowledge and previous studies. 

This observation suggests that biologists would benefit from better com- 
putational aids for hypothesis evaluation. Many such tools already exist, but 
their statistical power remains generally weak because, like most computational 
discovery techniques, they focus on data collected from a single study and typ- 
ically ignore available knowledge. In this paper, we demonstrate how one can 
utilize prior biological knowledge to substantially increase the statistical power 
of causal hypothesis evaluation. Along the way, we address a number of chal- 
lenges that this idea raises, including the facts that knowledge may come from 
different sources under different experimental conditions, have varying levels 
of uncertainty, and involve quantities that are not measured directly. 

In the section that follows, we provide a motivating example that describes 
a biological hypothesis and relevant background knowledge. We then present 
a computational framework and associated algorithm that lets us calculate the 
evidence in favor of such a hypothesis given both prior knowledge and data. 
We take a Bayesian approach to hypothesis evaluation, since this paradigm 
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provides ready mechanisms for combining data and knowledge from multiple 
sources. After this, we report experimental studies with the algorithm on 
synthetic data, to determine its robustness, and a specific biological hypothesis, 
to ensure its relevance. In closing, we review related work on causal models in 
biology and suggest some directions for future research in this area. 

2 A Motivating Example 

Mitogen-activated protein kinase signal transduction pathways process a wide 
range of extracellular stimuli to determine a cell’s transcriptional response 
to environmental changes or inter- cellular messages. One example, the c-Jun 
NH 7 -terminal kinase 7 (JNK/SAPK) pathway, responds to growth factors (e.g., 
TGF-£ and EGF), cytokines (e.g., TNF and IL-1), and forms of environmen- 
tal stress (e.g., osmotic and radiation). It terminates in the phosphorylation 
and activation of JUN-family transcription factors, which dimerize with FOS, 
ATF, or other JUN factors to form AP-1 leucine-zipper transcription factor 
complexes, 8 which in turn enhance or repress transcription of many immediate- 
early genes. The JNK pathway has been implicated in many cellular processes 
and pathologies, including embryonic morphogenesis, cancer, immune system 
response, apoptotic signaling, cardiac hypertrophic response, neurodegenera- 
tive disease, and diabetes complications. 9 The JNK pathway is also considered 
a promising intervention point for many pathological conditions. 

In humans and mice, the JNK family of kinases is derived from three 
genes, each of which elicits distinct responses under distinct conditions. 10,11 
Jnk3 is found almost exclusively in brain, heart, and testes, whereas Jnkl and 
Jnk2 are present in all tissues. 10 JNKs are known to phosphorylate several 
components of AP-1 complexes, including c-Jun, JunD, and Atf2? although 
the different JNKs differ in their ability to phosphorylate each target 12,13,10 
The JUN family of transcription factors consists of c-Jun, JunB, JunD, and 
the viral oncogene v-Jun. They differ in their activation conditions, the AP-1 
complexes iii which they participate, and their transcriptional targets. To date, 
most laboratory studies involving JNK pathways have studied the involvement 
of JNK or JUN as a group, rather than looking at specific JNK or JUN variants. 
However, understanding the interactions between specific variants is essential 
to untangling functional roles of these pathways. We use as our motivating 
example the hypothesis that c-Jun is uniquely activated by Jnk2, and therefore 
is not activated by Jnkl. 6 

b For example, Kallunki et al. 13 - 11 found that Jnk2 binds to c-Jun 25 times more efficiently 
than Jnkl, and Gupta 1 found that Jnk2 isoforms tend to have higher affinities for c-Jun 
and Atf2 than do Jnkl isoforms. Until recently, no kinase other than JNKs had been found 
capable of phosphorylating c-Jun, 12 but some evidence has emerged that an ERK kinase 



3 Representing Background Knowledge, Hypotheses, and Data 

A computational system that evaluates causal biological hypotheses in the 
context of background knowledge and data must first represent such knowledge, 
data, and hypotheses. We encode these in terms of relations among discrete 
variables that can take on the values 4* (up-regulated), — (down-regulated), or 
0 (unchanged) relative to a control condition. Facts and experimental data are 
divided into scenarios , with X{j denoting the value of variable x* in scenario 
j. When referring to a specific variable by name, we need only the scenario 
subscript, e.g., c-Jun., for the level of phosphorylated c-Jun in scenario j. 

Consider the background knowledge that increasing TPA causes IL-11 ex- 
pression to increase 16 [with other factors held constant]. To encode this, we 
assign this to a scenario, say j = 1, then define causal conditions, C\ = 
{TPAi = -fjTNF-au = 0} and known effects , Ei — {IL-lli = 4-}. En- 
coding of experimental data is done in a similar manner. Consider an ex- 
periment in which cells are exposed to an increased amount of TNF-a with 
TPA exposure unchanged, and in which expressions of nur77, FL1, and IL-11 
increase, whereas expressions of pl9 and p53 decrease. Assigning this exper- 
iment to scenario j = 2, we encode the experimental conditions or interven- 
tions as / 2 = {TNF-a 2 = +,TPA 2 = 0} and the observed effects as D 2 = 
{EL_nur77 2 = H~,EL-pl9 2 ~ — ,EL_FL1 2 = H~,EL_IL-11 2 — 4*,EL_p53 2 = — j . 
If Jnkl were also knocked out, the experimental conditions would be expressed 
as J 3 = {TNF-a 3 = +,TPA 3 = 0,Jnkl 3 = 0}, where Jnkl 3 = 0 states that 
Jnkl is controlled to be unchanged. 

Background knowledge also indicates which direct causal links are plausi- 
ble. We say that variable Xi is a causal parent of x 2 when externally changing 
Xi can effect a change in x 2 when all other variables are held unchanged. We 
specify plausible parent relationships , since there may be uncertainty over any 
specific causal link. In our example, we specify that each stimulus variable 
(TNF-a or TPA) is a plausible parent of every kinase (Jnkl or Jnk2), that 
each kinase is a plausible parent of every transcription factor (JunD or c-Jun), 
and that each transcription factor is a plausible parent of each expression vari- 
able. The hypothesis that Jnk2 uniquely activates c-Jun is represented by the 
assertion that there is no link from Jnkl to c-Jun, which may be true or false in 
any particular hypothetical causal model. Additional background knowledge 
takes the form of numeric assessments, a , that encode subjective beliefs about 
model parameters, including link likelihoods, reliability of causal statements, 
expected noise levels, and beliefs that particular links should be positive or 
negative influences. 


may phosphorylate c-Jun in specific cell types and developmental conditions . 14,15 


4 From Background Knowledge to Prior Probabilities 

Background knowledge consists of plausible causal links between variables, 
causal conditions <7 , known effects E } and numeric assessments, a. Because 
we are working in a Bayesian framework, we must transform all this back- 
ground knowledge into prior probabilities. Also, we must be able to express 
causal relations like “TNF-o (directly or indirectly) up-regulates JunD” and 
the results of experimental manipulations like applying TPA and knocking 
out Jnkl. Traditional notations for conditional probability do not capture the 
distinction between an estimate when a variable is observed and an estimate 
when a variable is manipulated. Thus, we introduce the notation P(X\Y : Z) 
for the probability of X when Y is believed or observed to be true and Z is 
exogenously forced to be true. With our background knowledge consisting of 
a, E , and <7, we define the prior causal probability as P(-|P, a : (7). 

A causal model structure, M, is an acyclic subset of directed links be- 
tween variables. Each variable X; has an associated conditional probability, 
P(xi\p aT M (x), Af, 6 m), which specifies its probability distribution conditioned 
on possible values of its parents, provided x* is not exogenously controlled. The 
conditional probability is parameterized by a set of parameters, 6 m, and does 
not depend on the scenario, so that together an instance of (M, 6m) defines a 
Bayesian network. For example, when M contains a link from Jnkl to JunB, 
9m includes the probabilities that JunB is up-regulated, stays the same, or 
is down-regulated given that Jnkl is up-regulated. When JunB has multiple 
parents according to M , our parameterization combines the contribution from 
each incoming link as a weighted mixture. We provide a detailed elucidation of 
our specific parameterization in a supplement. 17 We handle causal intervention 
by setting the local model to be P{Xij = z\par M (x), M, 6 : X itj =*,...) = 1 
whenever a variable is exogenously controlled, effectively severing the influence 
from the variable’s parents. 18,3 Following the Bayesian network product rule 
and combining scenarios, the joint distribution over all variables under the 
causal interventions in C is the product of the conditional probabilities: 

P{X\M,8 m : O = n II : Cij) (1) 

* j 

Knowing M and 6m is sufficient for estimating probabilities in situations that 
involve hypothetical causal interventions, such as P(c-Jun = +|ELJL-11 = 
0, M, 6m : TNF-a = +). In terms of M and 6 , we rewrite the prior causal 
probability as 


P(X, M,9 m \E, a : C) = P(M)P(6 m \ a)P(X\E, M, 6 M : C) , (2) 


where P(M) expands to k-exp(~\M\) to provide a simplicity bias and P(0m\ol) 
consists of products of Dirichlet priors. The final term is encoded by the 
Bayesian network (M, 9 m) from Equation (1). 

5 Evaluating Causal Biological Hypotheses 

As we have noted, biologists are often concerned with evaluating whether ex- 
perimental data, encoded by experimental conditions I and observed effects 
D, support some particular causal hypothesis, H. It is important to distin- 
guish between the degree to which (a) the data alone support the hypothesis, 
(b) the data and prior knowledge together support the hypothesis, or (c) the 
data support the hypothesis in the context of the prior knowledge. Classical 
statistical hypothesis testing generally focuses on (a) and Bayesian analysis 
usually focuses on (b), whereas we focus on (c). To this end, we first combine 
the prior probability of a causal model M with experimental data to determine 
the posterior probability that H is true, as in (b), then we utilize this to define 
a p value that codifies (c). For example, the scientist may which to publish 
results demonstrating how significantly his new data supports (or refutes) a 
hypothesis in the context of what was previously known. 

In our framework, a causal hypothesis is an expression that is either true 
or false as a function of the model M, its parameters 9m, and the values of 
latent variables X? The hypothesis used in our example, (JnH, c-Jun) £ M, 
is a function only of M. If we let 

prior(H) = P(H|E,a : C) 
posterior (H\D) = P(H\D, E,a : I,C) 

the latter expression, the posterior causal probability , can be rewritten as 

P(X , Af, 9 M \D y E,a:I,C) = £ P(M)P(9 M \ct)P(X,D } E\M , 0 M : I, C) (3) 

where £=1 /£ f P{M)P{8 M \a)P(D,E\M,9 M : I,C)d0 M 

MeM '* 

and where M is the set of possible causal models. 

A posterior probability may be an optimal metric to employ in decision 
making contexts, but it often does not measure the real concerns in scientific 
data analysis. The posterior can be hard to interpret due to subjective assess- 
ments in prior knowledge and does not specifically reveal the data’s support for 
the conclusions. A typical data analysis question is whether the experimental 


c We also let H depend on a hypothetical control condition, which lets us evaluate the 
outcomes of hypothetical causal interventions, but we have omitted this case here for the 
sake of simplicity. 


data support the hypothesis in the context of prior knowledge, which, as we 
have explained, differs from whether the data and prior knowledge together 
support the hypothesis. Many scientists are more familiar with the frequentist 
notion of a p value, which is conventionally not applicable to such knowledge- 
rich contexts. However, by combining the p value with the posterior, we obtain 

p(H) = Pr [ posterior(H\D l ) > posterior (H\D) | D* ~ prior ] 

where D' ~ prior denotes that D f are hypothetical observations drawn 
at random from prior As in classical statistics, the p value gives the 
probability that random observations would appear to support H as much as 
the actual data even when H does not hold, i.e., the probability that one can 
be misled by the data into thinking H is true when it is actually false. A 
p value near zero indicates that the data provide strong support for H. If 
p(^H) & 0, then they provide strong evidence against the hypothesis. 

We utilize a Metropolis- Hastings sampler 19 to compute posterior (H\D), 
based directly on Equation 3. The basic strategy, which is an instance of 
a Markov chain Monte Carlo algorithm, 20 rapidly samples as many plausible 
models as it can in a short time. The method samples each model with a 
probability that asymptotically approaches its posterior probability, and uses 
the resulting counts to estimate how often the hypothesis H is true. We sample 
M, 0, and X simultaneously, as opposed to most other approaches to Bayesian 
network induction, 2 ' 3 which usually use one method to search over possible 
model structures, M, a separate nested method (often based on EM) to fit 
parameters, 0 , and yet different algorithms for inference over X. 

Our algorithm begins with a starting model structure M, parameters 0, 
and value assignments to all latent variables, X. One point in the sample 
space, (M,0,X), can be conceptualized as one possible model of the underlying 
biological system and how its latent variables respond in each situation. The 
method then proceeds to sample new values for z t = (M t ,8t,X t ), based on 
zt-u to converge on the posterior distribution given by Equation 3. 

At each step a new point, z l = (X* f M' , O'), is randomly proposed according 
to a proposal probability and the Hastings acceptance probability 


a(zt_i, z*) — min 


1 n{z') g(z t - t \z') } 

' g(z'\zt-i) ) 


is computed, where 7r(z) is the right hand side of Eq. 3, and where zt is set to 
z* with probability a(zt-\,z') and otherwise to z<_i. The Hastings update rule 
guarantees detailed balance 20 and thus that P{z t ) 7 r(z) as t oo, provided 

that g can reach all points of the sample space. 


We will not specify in detail our choice for g , but note that it involves a 
mixture of strategies, one for altering M by adding or deleting links, one for 
updating 0, one for changing X , and a few more specialized methods. When a 
new sample changes only a few links, parameters, or values, can be up- 
dated incrementally, so that computational complexity depends on the number 
of changes rather than on the size of the model. Moreover, the normalization 
factors for n and g are not relevant and thus are not computed. Finally, we 
follow the standard practice for Markov chain Monte Carlo algorithms and 
begin with a ‘burn-in’ period before tallying statistics (usually l/10th the to- 
tal number of iterations), and we usually tally only every 100th sample in a 

100.000 sample run. Our Java-based implementation typically explores about 

2.000 proposals per second on a 1.5 GHz Pentium IV, with a typical proposal 
acceptance rate around 35 percent. 

To compute the p value for a hypothesis H y we first utilize the above algo- 
rithm to draw a sample (M y Qm,X), enforcing -tH during the process and treat- 
ing all variables in D as latent (i.e., ignoring their original observed values). We 
then extract D f from the sampled X , use D ' in place of D , and run the above 
procedure to compute posterior (H\D f ). We repeat this N times, tallying the 
mean and variance of posterior (H\D l ) across the different choices of D'. With 
small N, the number of times that posterior (H\D f ) > posterior (H\D) is a 
poor estimate of the p value, so we instead assume posterior (H \D') is normally 
distributed and use the area under the normal curve with the tallied mean and 
variance to estimate the probability that posterior (H\D r ) > posterior (H\D). 

6 Experimental Evaluation 

Our basic claim is that incorporating prior knowledge into causal biological 
hypothesis evaluation enhances statistical power, therefore making it possible 
to infer causal effects that would otherwise be undetectable in small sets of 
experimental data. To support this claim, we utilized the hypothetical, but 
biologically plausible, model in Figure 1 to generate synthetic data under six 
different experimental conditions. These experiments involved one wild type 
organism and two knockout conditions, each under stimulation by TPA or 
TNF-a. The ‘true’ model takes the same form as those described earlier, that 
is, a causal Bayesian network in which each random variable has the domain 
{+,0, — } and each causal influence is stochastic (with 10% of the observed 
values being altered by noise). We generated two sets of data for our evaluation: 
the first assumed a model in which Jnkl does not influence c-Jun, whereas the 
second assumed this influence does occur. 

Our first study aimed to demonstrate that prior background knowledge 
lets one obtain statistical support for a causal hypothesis even when some of 



Figure 1: A hypothetical regulatory system used to generate synthetic data. The variables 
TNF-a and TPA are extracellular stimulus levels controlled by experimental conditions. 
Jnkl and Jnk2 denote activation levels for the respective kinases, each of which are knocked 
out in two of six experiments. JunD and c-Jun, which denote activation levels of the re- 
spective transcription factors, are unobserved in the experiments. Variables prefaced with 
EL denote the observed expression levels of specific genes. Enhancement and suppression 
influences are indicated with pluses and minuses, respectively. 


the relevant variables are latent. We used background knowledge about the 
JNK/JUN signaling pathways collated from published literature, subdivided 
into 12 facts: c-Jun is a transcription factor (i.e., parent) of IL-11 16 and p53, 21 
JunD is a transcription factor of nur77, 22 TPA up-regulates IL-11, 16 JunD 23 
c-Jun 24 and nur77 22 JunD up-regulates nur77 25 c-Jun up-regulates IL-11 16 
and down-regulates p53 21 and Jnkl and Jnk2 are kinases, so they positively 
influence their targets. We used this knowledge in evaluating two hypotheses 

Hi : There is no direct causal link from Jnkl to c-Jun. 

There is a direct causal link from Jnkl to c-Jun. 

against the two data sets. We performed each evaluation twice, once using the 
entire corpus of background knowledge, and once using no prior background 
knowledge (except for the set of variables and plausible links). Table 1 shows 
the posterior probabilities and p values that resulted from these runs. 

Statistically significant support for the correct hypothesis occurs at the 
0.05 level only for the two cases that utilize background knowledge. Prom the 
data alone, neither hypothesis is supported at a significant level. Since c-Jun 
and JunD are both latent, the knowledge is necessary to relate them to the 
observed data; otherwise, the system finds no support for the hypotheses that 
involve those variables. 


Table 1: Statistical significance levels for causal hypotheses with and without the incorpo- 
ration of background knowledge. The first number denotes the posterior probability of each 
hypothesis, whereas the second denotes the p value. 


True Model 

All Prior Knowledge 

No Prior Knowledge 

Hi : no link 

H 2 : link ' 

Hi : no link 

H 2 : link 

No Influence 
Influence 

0.99 / 0.003 
0.03 / 0.500 

0.01 / 0.77 
0.97 / < 10“ 7 

0.42 / 0.46 
0.44 / 0.55 

0.58 / 0.51 
0.56 / 0.44 


Figure 2 shows the results of a more extensive study, using only data 
generated by the model in which Jnkl does not influence c-Jun, that relates 
statistical power to the amount of background knowledge available. In these 
runs, we ordered the 12 facts randomly and evaluated the hypothesis Hi (that 
the link is not present) while varying the number of facts given to the system 
from 0 to 12. The graphs reveal that a certain amount of background knowl- 
edge, about half the corpus, must be available before strong support for the 
hypothesis becomes evident. The sixth fact, which happened to be that c-Jun 
influences ELJL-11, was a major clue the system needed to disambiguate be- 
tween two main competing systems-level scenarios that appeared possible prior 
to this fact. With nine or more facts, the system detects that the data sets 
indeed support the hypothesis statistically at the 0.05 level. 

Without this ample biological knowledge, it would have been impossible 
to validate or refute the hypotheses from experimental data alone. However, 
when combined with such prior information, the data reveal clear support for 
the hypothesis at statistically significant levels. These results are consistent 
with our central thesis — that utilizing preexisting background knowledge can 
be crucial for effective hypothesis testing when only small samples are available 
or the variables are only partially observable. 

7 Discussion 

Our approach to interpreting biological data contrasts sharply with most com- 
putational work in this area. The most common methodology passes available 
data to an induction algorithm, which extracts regularities that, hopefully, are 
biologically meaningful. However, work in this paradigm typically assumes 
data are plentiful and makes little use of background knowledge about biology. 
Our framework instead takes advantage of prior knowledge and previous ex- 
perimental results in analyses of new observations, increasing their statistical 
power even when few samples are available. 
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Figure 2: Evidence that statistical power increases with available background knowledge, as 
reflected by the number of biological facts provided to the system. The data set supports 
the (correct) hypothesis at a significant level (p(#i) < 0.05) only when at least nine of the 
12 domain facts are utilized during evaluation. 

Despite this crucial difference, our approach has clear links to earlier re- 
search in computational biology. The strongest connection is to methods for 
learning Bayesian networks , 2,26,27,28 which also search a space for causal models 
that match the experimental data. For the special case in which each possible 
link is a hypothesis, we can view these systems as evaluating causal hypotheses 
with respect to their support by the data. Previous efforts within this frame- 
work have also dealt with the technical complications that arise with latent 
variables ' 29,27 and causal interventions , 3,18 both addressed in our own work. 

Nevertheless, most research on inducing Bayesian networks of biological 
systems has taken a knowledge- lean, data-mining approach. Despite a few 
exceptions that incorporate knowledge about promoter sequences , 4,30,31 typi- 
cal work in this paradigm attempts to construct a causal model from scratch, 
rather than evaluate particular causal relations in the context of background 
knowledge. Our own previous research on computational methods for revising 
causal biological models 5 comes much closer, but still emphasizes model dis- 
covery rather than hypothesis evaluation. A few researchers 32,33 have focused 
on model evaluation as opposed to model discovery. In particular, the JustAid 
system 32 also supports the use of experimental data to evaluate qualitative 
causal hypotheses, although its underlying algorithms are quite different and 
it addresses neuroendocrinology rather than gene regulation. 

At a computational level, our approach draws heavily on Markov chain 
Monte Carlo techniques for hypothesis testing from the statistical literature . 34 
However, our use of p values in a Bayesian context appears novel, in that 
the standard approach to Bayesian hypothesis testing involves comparing the 
Bayes Factors 34 or Bayesian Information Criteria 35 for alternative models. 





Both approaches are legitimate from a statistical perspective, but we have 
chosen to utilize p values because they are generally more familiar to biologists. 

Despite our encouraging results, we must extend our computational frame- 
work along a number of dimensions before it can become a useful tool for 
biologists. For example, we should explore other representations that let us 
encode qualitative causal knowledge about biological systems, especially no- 
tations that make stronger contact with established biological concepts like 
phosphorylation and dimerization. Moreover, we should incorporate this ex- 
tended formalism into a user interface that lets biologists visualize and manage 
their background knowledge, hypotheses, and experimental data. 

Another limitation of our current implementation concerns efficiency, in 
that its sampling strategy does not scale well to very large corpora of back- 
ground knowledge. Also, since our posterior distributions often exhibit isolated 
regions of high probability, achieving a workable mixing rate from the Hast- 
ings algorithm is challenging. In future work, we plan to make our inference 
methods more efficient by incorporating additional ideas from the literature on 
Markov chain Monte Carlo and Metropolis-Hastings algorithms. 36,20 Our ap- 
proach would also benefit from computational methods that generate plausible 
hypotheses automatically by reasoning over biological knowledge. 37 Finally, we 
must demonstrate the utility of our framework on data from actual biological 
experiments and on hypotheses they were designed to test. 

In summary, we have presented a computational approach to evaluating 
causal hypotheses that takes advantage of background knowledge and previ- 
ous experimental results to increase statistical power. Our framework encodes 
biological knowledge, hypotheses, and data in terms of qualitative relations 
between variables, and it utilizes Bayesian inference to calculate the evidence 
for and against each candidate hypothesis. We illustrated this approach to hy- 
pothesis evaluation in the context of knowledge and data about the JNK/JUN 
signaling pathways, and we demonstrated the increase in statistical power that 
background knowledge provides in this setting. We believe that the techniques 
we have described will make future tools for computational biology more robust 
and let them use available data more effectively. 
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